Name Variant Extraction from Individual Handle Identifiers

ABSTRACT

A method and apparatus for name variant extraction from individual handle identifiers uses a sequential extraction process to construct contextual information. Last name data, first/middle name data, initials, nicknames, and vanity names, along with numerical information indicating dates, may all be captured in extracting information about an individual associated with a particular handle. When multiple possible interpretations result from the analysis, those interpretations are ranked using optimality rules. The resulting data may be used to look up additional information in a consumer database in order to structure a targeted marketing message to the individual associated with the handle.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. provisional patentapplication No. 61/876,980, filed on Sep. 12, 2013, and entitled “NameVariant Extraction from Individual Handle Identifiers.” Such applicationis incorporated herein by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not applicable.

BACKGROUND OF THE INVENTION

Business are often faced with the task of associating consumer-specificattributes that have importance with respect to marketing campaigns toindividuals whose relevant data as known to the business containsincomplete and/or indirect contact information. Without the ability toassociate this consumer data to specific, well-defined individuals, thevalue of this data is significantly diminished. Important examples inonline marketing environments include situations where the onlyavailable information is a “handle” for the individual, such as an emailaddress or an identifier used in various social media channels, such asTwitter handles and Facebook user names.

Handles may be parsed in order to look for identifying information.Existing parsing applications often give exclusive focus on theidentification of first, middle, and last names embedded in an inputstring, and use fuzzy string matching to find variations of traditionalnames. Although such approaches may work well for strings primarilycomposed of names, most of these handles contain either limitedtraditional name components and/or phrases that identify traits andcharacteristics of the individual. Very small changes in a given namecan result in a word/phrase that clearly describes a personal attribute,and hence the use of fuzzy string matching in these cases may yield poorresults.

Marketing efforts vary between the identification of a specificindividual for direct marketing and the association of groups ofindividuals that share interests for online marketing. The associationof these handles with a particular individual, or their association withtraits associated with certain types of individuals, would be of greatvalue in marketing campaigns, such as the formulation of marketingmessages targeted to such individuals.

BRIEF SUMMARY OF THE INVENTION

The present invention in certain embodiments extracts meaningfulinformation from handles to produce a composite interpretation to offersignificant insight into expected consumer-specific attributes such asname, gender, ethnicity, and associations to other individuals. Culturaland demographic information as well as country, location affiliation,and language may also be extracted. By extracting an optimal set of name“variants” from these handle identifiers, insights about the associatedindividual may be found. In certain embodiments, the invention employs ahandle parser that offers a best-effort identification andinterpretation of the name and name-like components of the handle. Italso distinguishes this data from “sentinel” characters that oftendelimit different contextual components of the handle. Finally, as manyhandles are interpreted differently by multiple individuals, the parserin certain embodiments does not return a single “best” interpretationbut rather a ranked list of possible interpretations of the foundattributes from which an end user can choose the one that is mostconsistent with the perceived context.

These and other features, objects and advantages of the presentinvention will become better understood from a consideration of thefollowing detailed description of certain embodiments and appendedclaims in conjunction with the drawings as described following:

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a diagram showing a parsing process for handles according tocertain embodiments of the present invention.

FIG. 2 is a diagram showing functional components of a system accordingto certain embodiments of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

Before the present invention is described in further detail, it shouldbe understood that the invention is not limited to the particularembodiments described, and that the terms used in describing theparticular embodiments are for the purpose of describing thoseparticular embodiments only, and are not intended to be limiting, sincethe scope of the present invention will be limited only by the claims.

The most obvious attributes that may be used to gain insight into aparticular individual associated with a handle are first, middle, andlast names. These may appear in a handle, but are often presented in anabbreviated or variant form. For example, an initial may be used for afirst, middle, or last name, or a nickname or name variant (“Barbra” for“Barbara”, “Willy” for “William”) may be presented in place of the firstor middle name. Nevertheless, when a name is represented in this manner,it can still often be identified as such by means of a strongco-locational or other linguistic relationship with previouslyidentified potential name components. For example, first names areprimarily found “near” last names, where “near” can mean either“separated by a few sentinel characters” or “separated by very fewidentified entity or vanity phrases. For example, the handlejosephvee@hotmail.com is an email address that may be interpreted asreferring to a “Joseph V.” The “V.” may be understood as an initial fora last name. Likewise, “pthemanjohnson_gohogs.com” may be understood torefer to a person named “P. Johnson.” In this case, the first name isrepresented by an initial, with a vanity name (“theman”) separating thisinitial from the last name.

In a similar manner, the actual number of potential first and last nameswithin a handle is often bounded in terms of meaningful interpretationof the handle, and these bounds can be used to interpret the potentialmeaning of identified name phrases. For example, the phrase“kelly_marie_(—)10” contains two traditionally female first names inclose proximity. Since the phrase does not contain a “relationshipconnective” between the two first names (such as “and”, “luvs,” or“bff”), these two name components can be interpreted as a firstname/middle name component, but the possibility that the two identifiednames are first names of different individuals cannot be completelyignored. Likewise, without the occurrence of a relationship connectiveit is quite rare to see multiple last names in a handle that are notjoined by a hyphen or ampersand, and an interpretation of more than twoidentified last names within a handle is extremely difficult without arich context from the remaining portion of the handle. In certain suchinstances one or more of these last name components can be interpretedas a first name, often with a much lower but non-zero historicalfrequency.

As noted above, many name phrases have historically been used as both afirst and last name, and in these cases both the positioning of thenames as well as historical and demographic frequencies of the use ofthe names are combined to help rank the most likely interpretation ofthe phrase. Sometimes two name phrases are constructed from such names,such as “derekleonard” and “terrytalley”, and in these cases there canbe 4 different meaningful interpretations (first name/middle name, firstname/last name, last name/first name, and last name/last name). Each ofthese interpretations are identified and passed on to the rankingportion of the final possible interpretations of the handle.

In order to identify name types such as these, several public andprivate name dictionaries (tables) are employed. In addition to offeringname-to-name variant relationships, these tables provide gender,ethnicity, and/or country of origin information related to theidentified name. Also, they provide frequency of use information thatcan be used in name type disambiguation mentioned above. In addition,recently in the U.S. there has been a significant shift in first namechoices for babies that cannot be adequately reflected by means of theU.S. Census aggregate name data tables. Many of these “newer” names arevariants of more traditional ones, but fuzzy string matching approachesoften create a large degree of ambiguity and misinterpretation in othertypes of meaningful phrases in the handles. Hence gender specific firstname tables are created from public tables of the most popular firstnames over the last 20 years. For some sets of handles whose users areprimarily from a young population (for example Twitter) these tables cantake precedence over the more historical frequency tables, whereas forother handle sources whose primary users are older (e-mail) thehistorical tables should be given higher priority.

The interpretation of identified potential name phrases depends heavilyon a broader contextual understanding of the components of the handle.This context is often constructed by descriptive and connective phrasesthat can be captured by the use of dictionaries and interpreted inconjunction with the name phrases. This context can disambiguate phrasesthat can represent popular names, common entities, relationships, and/oractions (such as “love” and “joy”). In addition to this disambiguationstep, the “non-name” phrases often represent non-name personalattributes of the individual that are useful for categorizing theindividual for marketing efforts.

One type of frequently used phrase is a “vanity” name that oftenindicates a special interest or demographic attribute of the associatedindividual. Instances of such vanity names include descriptivewords/phrases such as “soccermom” and “sexxy” as well as names fromantiquity, mythology, and/or modern pop culture. These types of namesare sometimes used for a consistency check of the collection of capturedname components in terms of inferred personal attributes such as genderor ethnicity. In the example of “P. Johnson” described above, the vanityname “theman” within the handle is a likely indicator that theindividual associated with this handle is male; such information may bevaluable since the first initial “P.” does not provide genderinformation. Similarly, the identified phrase “soccermom” indicates thatthe associated individual is probably female, has one or more childrenwithin an age range that have a particular interest in soccer and a moregeneral interest in athletics/sports participation. One or moredictionaries of relevant vanity names will be used in the identificationof such vanity names.

Another type of contextual phrase that is useful to capture is an“entity” phrase which identifies a specific or general location(“bavaria”, “WindyCity”, “bigD”), affiliation with an institution orsports team (“gohogs”, “lakersfan”, “aggiemom”) or a specific interest(“hockeyrulez!”, “music”, “party_fool”). As noted before, such capturedphrases can validate or enhance personal attributes such as gender,ethnicity, interests and location. Multiple dictionaries are used toidentify such entity phrases (state, city, city nickname, institutions,institution nicknames, locations such as “bay area” and “rocky_mtn”) aswell as entity phrase “auxiliary” words such as “go,” “boy,” “fan,”“grad,” and “born.” As words like “love” can be interpreted in a varietyof ways in the contextually limited scope of handles (name phrase,relationship between two name phrases, entity phrase auxiliary word) itis important to identify the type of co-locational words to these, andhence the scope of entity dictionaries needs to be quite broad.

Some handles contain relationship information between multiple instancesof name components that represents different individuals such as“suzylovesbill4ever.” These relations are important to identify since“Suzy Love” is a potential meaningful full name contained in the handlebut clearly cannot be meaningfully inferred as such from the handle.Hence it is important to identify “connectives” and their abbreviatedvariants within the handle. These connectives include “and”, “'n”,“loves”, and “luvs” and are stored within a single dictionary.

Finally, sequences of two through eight digits can offer temporalinformation about an individual in terms of a date or year. Suchsequences of digits should be viewed as potential “temporal names” thatmay offer consistency validation or additional evidence of specificattributes. These temporal names often refer to birthdates, birth years,graduation dates, and dates of other significant events in theindividual's or related individual's life. The possible contextualinterpretation of the actual date and/or year often depends on therelationship of the date to the current date as well as the linguisticor co-locational ties to identified entity phrases. For example, thehandle “paul_g_(—)09” can be interpreted as representing an individualwhose name is Paul G. and whose (probable high school) graduation yearwas 2009 (instead of a birth year). Similarly, the handle“wife&mother_JL_(—)7876” can be interpreted as identifying a marriedwoman with children with initials JL and whose birthdate is Jul. 8, 1976as such date abbreviations frequently occur in handles. When a substringof two to eight consecutive digits is found, it must be determined ifthat substring can represent a date in terms of its expressive power toidentify a month, day, and/or year. Any possible set of ordering ofthese three date components can be chosen as legitimate patterns,including the option to allow all such orderings.

If such date interpretations are found, the associated digit string ismarked with the temporal name meanings. For example, the string ‘999’can be interpreted as “September, 1999”, the string “120513” as “Dec. 5,2013”, and “1123” as “November 23”. Some digit strings can beinterpreted as a temporal string in multiple valid ways. For example,the digit string “11176” can be interpreted as the temporal name “Jan.11, 1976” or “Nov. 1, 1976,” and each interpretation can be reported. Onthe other hand the strings “88888” and “767676” do not possess a validmonth/day/year interpretation and hence would not be marked as atemporal name. Finally, the number of days in each month and theidentification of leap years are used in the interpretation of validtemporal names.

The extraction of information from handles based on the principles setforth above may be described with reference to the flow chart of FIG. 1.A handle 10 is input into the process. The extraction begins with apreprocessing step that attempts to clean the handle of punctuation thatis often used as filler or decorations as well as the domain suffix ifit is an email address. As mentioned earlier, often sentinel charactersare used to embellish parts of the handle. Such examples include“R.A.C.H.E.A.L.” and “S*U*Z**I*E*!” The result of this pre-processing isthe de-noised handle 12. The removal of such sentinel characters oftensimplifies the identification of the different name phrases, but somesentinel characters such as “/” can act as meaningful contextual hintswithin phrases (such as temporal names) that disambiguates the meaning.The set of such potential sentinel characters to be removed can beadjusted within the process, but extra care is taken in the developmentof the name dictionaries and logic for the identification of temporalnames to account for the existence of disambiguating sentinel charactersafter this pre-processing step.

Once the handle is prepared for subsequent processing, each of thedifferent phrase types are searched for in the string independent of theother types, and the identified phrases are marked both in terms of thephrase type and exact location of the phrase at marking step 14. Theresulting sets of marked phrases are independent of the order of thetypes searched. For example, in the handle “suzylovesbill4eve” each of“suzy,” “bill,” and “eve” will be marked as potential first names,“love” will be marked as both a potential last name and vanity name, and“loves” will be marked as a potential connective. Marking step 14utilizes the databases 16 that include one or more databases in thecategories last names database 18, first/middle names database 20, datesdatabase 22, entity names database 24, vanity names database 26, andconnectives database 28. In certain embodiments, first/middle namesdatabase 20 may be divided into multiple databases, such as atraditional first/middle name database and a recent first/middle namedatabase, which may be separately marked in marking step 14.

In order to control the exponential growth of the process of combiningdifferent, possibly overlapping instances of phrase types into multiplefull handle interpretations, the systematic construction of theinterpretation of the handle must not over-populate the growing set ofinterpretations with logically sound but conceptually meaninglessinstances, yet it must preserve the vast majority of meaningfulinstances. This process addresses this issue by first ordering thephrase types at phrase type extraction sequence 30 and progressivelyconstructing the interpretations of the handle in the prescribed order.Any possible ordering of the phrase types is allowable, althoughspecific ones can perform better in certain social media contexts as thefrequencies and combinations of the different phrase types can varygreatly across social media platforms. One possible ordering is vanitynames, last names, first/middle names, entity names, dates, andconnectives. The source of the handle may be used as an input todetermine order. For example, where two separate databases are used fortraditional first/middle names and recent first/middle names, precedencemay be given to the recent first/middle name database where the sourceof the handle would indicate that the associated individual is morelikely to be a young adult.

Prior to the start of the interpretation phase a “maximum allowableoverlap” for two substrings is set. This value is important in theinterpretation of adjacent phrases as sometimes two phrasesintentionally share some consecutive characters. For example, in thestring “susanick” the first names “susan” and “nick” share a singlecharacter, and without an allowable overlap of at least one characterboth names would not be included in a common interpretation. Once thisoverlap value is set the interpretation stage begins with an “empty”interpretation.

Starting with the first chosen phrase type each of its identified wordsis added one or more times to the growing collection of interpretations.If the word can be added to one or more of the existing interpretationswithout exceeding the maximum allowable overlap for each phrase in theinterpretation that addition is performed and the next word isprocessed. If the word cannot be added to any existing interpretation, anew interpretation is created with that word and is added to theinterpretation collection. Subsequently, each of the remaining phrasetypes is processed in the same fashion until all phrase types havecontributed to the interpretation of the handle.

Prior to creating the ranking of the interpretation results, thecollection of the constructed handle interpretations are filtered atfiltering step 32 by first expanding the collection and then trimmingthe collection of “redundant” instances according to filtering rules 34.The expansion is accomplished by creating new instances from each of theexisting ones by labeling each found phrase type instance by all of thephrase types for that instance. For example, in the string“suzylovesbill4ever” one resulting interpretation is [FN:suzy, LN:love,FN:bill, FN:eve]. If connectives were extracted after last names then“love” would not have been interpreted as such. So the additionalinterpretation [FN:suzy, CONN:love, FN:bill, FN:eve] would be added tothe collection. In order to minimize the size of the collection thereare two different types of “redundancy” that should be addressed. Thefirst type is the obvious one, namely the removal of duplicateinterpretations that can potentially be formed in the aforementionedconstruction and expansion process. The second redundancy occurs whenone interpretation is properly contained in another interpretation (bothin terms of the phrase types and substrings). For example, for thehandle “suzylovesbill” the interpretation [FN:suzy, CONN:love, FN:bill]is completely contained in the interpretation [FN:suzy, CONN:loves,FN:bill] and hence will be removed from collection, retaining the latterinterpretation. However, the interpretation [FN:suzy, LN:love, FN:bill]would not be removed as the overlapping strings “love” and “loves” areinterpreted as different types of phrases.

The resulting interpretations are then ranked at ranking step 36 basedon user-defined criteria stored in optimality rules 38, where the higherranked interpretations embody these criteria more directly than thelower ranked instances. Types of criteria and their impact on theranking include the following:

(a) percentage of the handle that was interpreted in terms of the phrasetypes—adjust the ranked score by an associated percentage increase ordecrease relative to a statically fixed or dynamically computedminimally expected percentage;

(b) three or more identified last names—decrease the ranked score by asignificant proportion unless an appropriate number and location offirst names or vanity names were identified; and

(c) connectives with no name phrase on one or both sides of theconnective—decrease the ranked score by a significant proportion.

Upon completion of the ranking adjustments, initials can be added at thebeginning of found last names or at the end of first names if theappropriate adjacent character has not been included in a phrase in theinterpretation.

Any portion of the resulting sorted/ranked list of interpretations canbe used to identify an individual, or alternatively sufficientinformation to determine certain characteristics about this individual,with extracted data 40. This information may be used to search aconsumer database 42 in order to collect more data about the individualassociated with the handle. For example, if enough name information isfound to associate the name with a particular individual consumer, thenthis information may be used to connect all known information about aparticular individual that is found in a general consumer informationdatabase. This information is returned as marketing data 44. Otherwise,information that is uncovered may be used to gain additional insightsabout the individual in order to better prepare a targeted marketingmessage.

The system for implementing the steps of FIG. 1 in certain embodimentsof the present invention is a computing device 500 as illustrated inFIG. 2, which is programmed by means of instructions to result in aspecial-purpose computing device to perform the various functionalitydescribed herein. Computing device 500 may be physically implemented ina number of different forms. For example, it may be implemented as astandard computer server as shown in FIG. 2 or as a group of servers,operating either as serial or parallel processing machines.

Computing device 500 includes in the server example of FIG. 2microprocessor 502, memory 504, an input/output device or devices suchas display 506, and storage device 508, such as a solid-state drive ormagnetic hard drive. Each of these components is interconnected usingvarious buses or networks, and several of the components may be mountedon a common PC board or in other manners as appropriate.

Microprocessor 502 may execute instructions within computing device 500,including instructions stored in memory 504. Microprocessor 502 may beimplemented as a single microprocessor or multiple microprocessors,which may be either serial or parallel computing microprocessors.

Memory 504 stores information within computing device 500. The memory504 may be implemented as one or more of a computer-readable medium ormedia, a volatile memory unit or units such as flash memory or RAM, or anon-volatile memory unit or units such as ROM. Memory 504 may bepartially or wholly integrated within microprocessor 502, or may be anentirely stand-alone device in communication with microprocessor 502along a bus, or may be a combination such as on-board cache memory inconjunction with separate RAM memory. Memory 504 may include multiplelevels with different levels of memory 504 operating at differentread/write speeds, including multiple-level caches as are known in theart.

Display 506 provide for interaction with a user, and may be implemented,for example, as an LCD (light emitting diode) or LCD (liquid crystaldisplay) monitor for displaying information to the user, in addition toa keyboard and a pointing device, for example, a mouse, by which theuser may provide input to the computer. Other kinds of devices may beused to provide for interaction with a user as well.

Various implementations of the systems and methods described herein maybe realized in computer hardware, firmware, software, and/orcombinations thereof. These various implementations may includeimplementation in one or more computer programs that are executableand/or interpretable on a programmable system including at least oneprogrammable microprocessor 502, which may be special or generalpurpose, coupled to receive data and instructions from, and to transmitdata and instructions to, a storage system, one or more input device,and one or more output device.

The computing system can include a consumer computing device, such as adesktop computer, laptop computer, tablet, smartphone, or embeddeddevice. In the example of FIG. 2, a desktop computer is shown. In thiscase, client device 512 is the consumer computing device, and runs a webbrowser 514 in order to access the Internet 510, which allowsinterconnection with computing device 500. A client and server aregenerally remote from each other and typically interact through acommunication network. Client device 512 may be the source of a handlefor processing as described herein, such as when a user is engaging incommunications over social media or sending a request for moreinformation through a website operated by a retailer that wishes to senda targeted marketing message to the individual.

Unless otherwise stated, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can also beused in the practice or testing of the present invention, a limitednumber of the exemplary methods and materials are described herein. Itwill be apparent to those skilled in the art that many moremodifications are possible without departing from the inventive conceptsherein.

All terms used herein should be interpreted in the broadest possiblemanner consistent with the context. In particular, the terms “comprises”and “comprising” should be interpreted as referring to elements,components, or steps in a non-exclusive manner, indicating that thereferenced elements, components, or steps may be present, or utilized,or combined with other elements, components, or steps that are notexpressly referenced. When a grouping is used herein, all individualmembers of the group and all combinations and sub-combinations possibleof the group are intended to be individually included. All referencescited herein are hereby incorporated by reference to the extent thatthere is no inconsistency with the disclosure of this specification.

The present invention has been described with reference to certainpreferred and alternative embodiments that are intended to be exemplaryonly and not limiting to the full scope of the present invention, as setforth in the appended claims.

1. A computer-implemented method for extracting information about anindividual from a handle, comprising the steps of: a. receiving thehandle at a processor; b. marking name phrase types in the handle tocreate a collection of name phrase types; c. extracting one or more namephrase types from the marked collection of name phrase types to produceone or more handle interpretations; d. filtering the one or more handleinterpretations to eliminate any redundant handle interpretationsaccording to a set of filtering rules; and e. ranking the one or morehandle interpretations utilizing a set of optimality rules.
 2. Thecomputer-implemented method for extracting information about anindividual from a handle of claim 1, wherein the marking step isperformed by parsing the handle based on a name database.
 3. Thecomputer-implemented method for extracting information about anindividual from a handle of claim 2, wherein the name database comprisesone or more of a last name database, a first name/middle name database,a dates database, an entity names database, a vanity names database, anda connectives database.
 4. The computer-implemented method forextracting information from an individual handle of claim 3, wherein thename database comprises a male first name database and a female firstname database.
 5. The computer-implemented method for extractinginformation from an individual handle of claim 3, wherein the first namedatabase comprises a traditional first name database and a recent firstname database.
 6. The computer-implemented method for extractinginformation from an individual handle of claim 5, wherein the first nameportion of the marking step is performed with preference to thetraditional first name database for a first set of handle sources, andthe recent first name database for a second set of handle sources. 7.The computer-implemented method for extracting information from anindividual handle of claim 1, further comprising the step of identifyingany sentinel characters and removing at least a part of the identifiedsentinel characters in the handle prior to the marking step.
 8. Thecomputer-implemented method for extracting information from anindividual handle of claim 7, further comprising the step of identifyingand removing only non-disambiguation sentinel characters.
 9. Thecomputer-implemented method for extracting information from anindividual handle of claim 1, further comprising the step of identifyinga maximum string overlap and utilizing the overlap in the extractionstep.
 10. The computer-implemented method for extracting informationfrom an individual handle of claim 1, wherein the filtering stepcomprises the step of removing duplicate interpretations.
 11. Thecomputer-implemented method for extracting information from anindividual handle of claim 10, wherein the filtering step furthercomprises the step of removing any interpretation that is whollycontained within another interpretation.
 12. The computer-implementedmethod for extracting information from an individual handle of claim 1,wherein the ranking step comprises the step of ranking interpretationsbased on a percentage of the handle that was interpreted in the one ormore handle interpretations.
 13. The computer-implemented method forextracting information from an individual handle of claim 1, wherein theranking step comprises the step of ranking interpretations based on anumber of last names found in the one or more handle interpretations.14. The computer-implemented method for extracting information from anindividual handle of claim 1, wherein the ranking step comprises thestep of ranking interpretations based on the presence of a connective ineach of the interpretations without a last name or first name/middlename adjacent to the connective.